Group 10 members:

  • Nathan Dawson - s3381089
  • Aaron Schmid - s3784832
  • Sam Breznikar - s3413437
  • Rohit Menon - s3544222
  • Rafael Perez - s3769737

Predicting Diamond Price with Attribute Data

Introduction

The purpose of this study is to build a model to predict diamond price, with the attribute data of 53940 round cut diamonds. The dataset used is from Dr Vural Aksakalli's GitHub: https://github.com/vaksakalli/datasets/blob/master/diamonds.csv (Aksakalli, 2019).

Dr Aksakalli provides documentation for this dataset at: https://github.com/vaksakalli/datasets/blob/master/github_datasets_desc.ipynb (Aksakalli, 2019). More detailed descriptions of this dataset's features can be found at https://ggplot2.tidyverse.org/reference/diamonds.html (Wickham & Co., 2015).

Dr Aksakalli's dataset appears to be a mirror of the original diamonds dataset from ggplot2, which can be found at https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv (Wickham & Co., 2015).

This report is structured into sections as described below:

  • Overview provides details about the dataset used, and various features within the dataset.
  • Data Preparation provides details about steps taken to clean and prepare the dataset for analysis.
  • Data Exploration provides more in-depth details about the dataset's features, and how they are related to each other.
  • Statistical Modeling and Performance Evaluation details one full and three partial linear regression models, as well as diagnostic checks for each model.
  • Summary and Conclusions presents the ultimate findings of this study, and provides a description of the methodology utilised.

Overview

Data Source

We used the diamonds.csv file from Dr. Vural's Github for this analysis. The diamonds.csv file contains all feature details, and has a total size of 53940 observations. Extended variable descriptions for this dataset were found in the original ggplot2 diamonds reference documentation.

Project Objective

The objective of this project is to observe whether there is a relationship between the price of a diamond and its features, and in turn attempt to predict the price of a diamond based on those characteristics.

Target Feature

The target feature is price, a continuous numerical feature. Because the target feature is numeric (we are trying to predict price), this is a linear regression problem.

Descriptive Features

Variable descriptions in diamonds.csv :

  • carat: Continuous.
  • cut: Fair, Good, Very Good, Premium, Ideal. (total 5 categories)
  • color: D, E, F, G, H, I, J. (total 7 categories)
  • clarity: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF. (total 8 categories)
  • depth: Continuous
  • table: Continuous
  • x: Continuous
  • y: Continuous
  • z: Continuous
  • price: Continuous

Carat, cut and price variables are self explanatory. Other less obvious variable explanations, as described in the ggplot2 reference documentation, are:

  • color rates the diamonds colour from D (worst) to J (best).
  • clarity rates the diamond's clarity from I1 (worst) to IF (best).
  • x is the diamond's length in mm.
  • y is the diamond's width in mm.
  • z is the diamond's depth in mm.
  • depth is the total depth percentage of the diamond = z / mean(x, y) = 2 * z / (x + y)
  • table is the flat top of the diamond, or the width of the diamond relative to it's widest point.

It is important to note that depth is fundamentally different to z, depite both of them measuring "depth". z measures the absolute top-to-bottom size of the stone im mm, whereas depth (also known as "depth percentage") is expressed as a percentage, and is a measure of whether a diamond is overweight or underwight relative to its diameter.


Data Preparation

Preliminaries

We begin by importing required modules and setting global variables. We also assign constant SAMPLE_SIZE.

In [0]:
# Import required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import warnings
import patsy
from IPython.display import display, HTML
warnings.filterwarnings('ignore')
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn-white")
sns.set(style="dark", color_codes=True)

SAMPLE_SIZE = 10000

Next, we import the data (.csv file) from its source URL and create a DataFrame. The .csv already has a header, so no need to specify column (attribute) names.

In [2]:
url = ('https://raw.githubusercontent.com/vaksakalli/datasets/master/diamonds.csv')

diamond_data = pd.read_csv(url, sep = ',', header = 0)

diamond_data.sample(15, random_state=99)
Out[2]:
carat cut color clarity depth table x y z price
42653 0.40 Premium G IF 60.8 58.0 4.73 4.77 2.89 1333
4069 0.31 Very Good D SI1 63.0 55.0 4.32 4.34 2.73 571
27580 1.60 Ideal F VS2 62.0 55.0 7.49 7.54 4.66 18421
33605 0.31 Good D SI2 63.1 54.0 4.33 4.38 2.75 462
34415 0.30 Ideal G IF 61.2 57.0 4.35 4.38 2.67 863
46932 0.52 Premium G VS1 62.4 60.0 5.12 5.11 3.19 1815
52243 0.80 Very Good J VS1 62.7 58.0 5.91 5.95 3.72 2487
38855 0.40 Ideal G VVS2 62.4 56.0 4.74 4.72 2.95 1050
38362 0.33 Ideal D VVS2 61.1 56.0 4.46 4.44 2.72 1021
20258 1.72 Ideal J SI1 62.0 57.0 7.66 7.62 4.74 8688
37444 0.51 Very Good J VS2 61.9 61.0 5.09 5.12 3.16 984
32912 0.33 Ideal F VVS2 61.9 56.0 4.42 4.46 2.75 810
11992 1.14 Ideal H SI1 61.6 57.0 6.68 6.73 4.13 5146
45683 0.50 Ideal F VS1 61.2 56.0 5.13 5.17 3.15 1695
17521 1.04 Very Good G VS1 61.8 58.0 6.50 6.55 4.03 7049

Data Cleaning and Transformation

We now check the data types of each feature, checking that the data types match the descriptions found in the ggplot2 documentation. As this dataset is 53,940 observations in size, for ease of computation, we will work with a random sample of the data, subsetting to a working sample of 10,000 random observations.

In [3]:
# Subset original dataset
diamond_data_sample = diamond_data.sample(n=SAMPLE_SIZE, random_state=479)

# Print shape and data types
print("The shape of this sampled dataset:", diamond_data_sample.shape)
print("Data types for each feature shown below with 'object' being a string type.\n")
print(diamond_data_sample.dtypes)
The shape of this sampled dataset: (10000, 10)
Data types for each feature shown below with 'object' being a string type.

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
x          float64
y          float64
z          float64
price        int64
dtype: object

We can see there are a total of 7 numeric and 3 categorical features.

Checking for Missing Values

In [4]:
print("Number of missing values for each feature:\n")
print(diamond_data_sample.isnull().sum())
Number of missing values for each feature:

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

There are no NaN values for any attribute. We will later check for any incorrectly named or labelled values.

Summary Statistics

In [5]:
# Use int64 and float64 types to include all continuous variables
display(HTML('<b>Table 1: Summary of continuous features</b>'))
diamond_data_sample.describe(include=['int64', 'float64'])
Table 1: Summary of continuous features
Out[5]:
carat depth table x y z price
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 0.801641 61.729580 57.455370 5.739420 5.747031 3.542815 3969.075700
std 0.476076 1.419893 2.226788 1.127832 1.239632 0.698546 4009.979502
min 0.200000 43.000000 49.000000 0.000000 0.000000 0.000000 335.000000
25% 0.400000 61.000000 56.000000 4.710000 4.720000 2.910000 947.000000
50% 0.700000 61.800000 57.000000 5.700000 5.715000 3.520000 2402.500000
75% 1.050000 62.500000 59.000000 6.550000 6.550000 4.040000 5383.750000
max 4.500000 71.200000 73.000000 10.230000 58.900000 8.060000 18791.000000
In [6]:
# Use object type to include all categorical variables 
display(HTML('<b>Table 2: Summary of categorical features</b>'))
diamond_data_sample.describe(include='object')
Table 2: Summary of categorical features
Out[6]:
cut color clarity
count 10000 10000 10000
unique 5 7 8
top Ideal G SI1
freq 4022 2159 2362

Observation of Table 2 shows us that cut, color and clarity have 5, 7, and 8 categories respectively, which matches the totals for each variable described previously in Overview - Descriptive Features. This means all categorical variable observation data is correctly labelled within the dataset (no cardinality issues), and we do not need to fix or amend the data for these variables.

Continuous Features

All continuous variables of our dataset would intuitively appear to have predictive power (an effect on price), so we will leave them all as is. Variables depth and table are derivatives of x, y, and z. At this stage we can't tell if removing depth and table will affect our predictions, so lets leave them in for the full model.

Categorical Features

We examine the unique values of each of the categorical variables.

In [7]:
cat_vars = ['cut', 'clarity', 'color']

for var in cat_vars:
  print("Unique values for", var)
  print(diamond_data_sample[var].unique(), "\n")
Unique values for cut
['Good' 'Premium' 'Ideal' 'Very Good' 'Fair'] 

Unique values for clarity
['VS1' 'SI2' 'VS2' 'SI1' 'VVS1' 'VVS2' 'IF' 'I1'] 

Unique values for color
['H' 'J' 'I' 'F' 'E' 'G' 'D'] 

Fortunately, our variables are already formatted nicely, and do not require any further cleanup.

Fixing Column Names

In [8]:
diamond_data_sample.head()
Out[8]:
carat cut color clarity depth table x y z price
7453 0.90 Good H VS1 64.1 57.0 6.07 6.04 3.88 4234
9620 1.28 Premium J SI2 62.3 58.0 6.94 6.89 4.31 4636
25090 2.01 Premium J VS2 60.4 59.0 8.10 8.15 4.91 13619
22581 1.57 Ideal I SI1 62.2 56.0 7.42 7.47 4.63 10633
52943 0.70 Very Good F VS2 63.3 56.0 5.62 5.66 3.57 2593

Much the same as the our categorical variables, our column names are already formatted nicely, and do not require any further cleanup.


Data Exploration

Univariate Visualisation

We decided to explore the price of diamonds using a line plot, boxplot, and histogram.

Line plot:

In [9]:
# create, format and display a line plot
plt.figure(figsize=(18,6))
p1 = diamond_data_sample['price'].value_counts().sort_index().plot(kind="line", color="royalblue")
plt.title("Figure 1: Line chart of Diamond price", fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

The line plot above reveals the prices for diamonds (contained within our random sample) and the number of times it occurs. As can be seen, diamonds are more commonly priced on the lower end of their price range at roughly 1000 USD.

Boxplot for diamond price:

In [10]:
# create, format and display a boxplot
plt.figure(figsize=(18,6))
p2 = sns.boxplot(diamond_data_sample['price'], palette="Blues").set_title('Figure 2: Box Plot of Diamond Price', fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.show();

We can see from the boxplot (Figure 2) that the price feature is considerably right-skewed. The left whisker of the boxplot is equal to the minimum price while the right whisker is roughly at 12000, hence all diamond prices above 12000 USD can be considered outliers and therefore very rare.

It can also be observed that 50% of diamonds will be priced between roughly 900 USD and 5400 USD, with a median of 2500 USD.

Histogram for diamond price:

In [11]:
# create, format and display a histogram
plt.figure(figsize=(18,6))
p3 = sns.distplot(diamond_data_sample['price'], kde=True, bins=18).set_title('Figure 3: Histogram of Diamond Price', fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.show();

# 

As we could already see from Figure 2, Figure 3 also demonstrates Price's right-skewedness, having a long right tail. The histogram shows the percentage frequency of diamond prices with each column having a range of 1000 USD and starting at roughly 350 USD. Therefore it can easily be seen that majority of diamonds fall within 350-5000 USD.

Multivariate Visualisation

Here, we decided to use pair plots.

With pair plots we visualise all variables and how they relate to and affect price.

Pair plots of Numeric features and Price

In [12]:
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["carat"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 4: Relationship Between Carat and Price", fontsize = 16)
plt.show()

It can be clearly seen from Figure 4 that as the diamond's carat increases, so does the price of said diamond. Although a little difficult to tell, the price most likely increases exponentially versus carat.

In [13]:
#Plot of Depth and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["depth"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 5: Relationship Between Depth % and Price", fontsize = 16)
plt.show()

From Figure 5 of our data, no clear relationship between depth % and price exists (The depth % is equal to the depth of the diamond divided by its width). This could be because of the characteristics we are analysing, the depth % has the least affect on the price. If instead all characteristics except the depth % were identical, then perhaps we would see a clear relationship.

In [14]:
#Plot of Table and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["table"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 6: Relationship Between Table and Price", fontsize = 16)
plt.show()

As was with Figure 5, Figure 6 shows that there appears to be no clear relationship between the price and the diamonds table %. (The table % is the width of the diamond's table divided by the width of the diamond)

In [15]:
#Plot of X	and Price 
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["x"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 7: Relationship Between Diamond X (mm) and Price", fontsize = 16)
plt.show()
In [16]:
#Plot of Y and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["y"],y_vars=["price"])
g.set(xlim=(-1,11)) #Set limit on x-axis as there is a single point around x=60
g.fig.set_size_inches(18,6)
plt.title("Figure 8: Relationship Between Diamond Y (mm) and Price", fontsize = 16)
plt.show()
In [17]:
#Plot of Z and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["z"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 9: Relationship Between Diamond Z (mm) and Price", fontsize = 16)
plt.show()

As can be seen in Figures 7, 8 and 9, the price of a diamond is significantly affected by its dimensions, with each figure corresponding to the diamond's x, y and z dimensions respectively. It can clearly be seen that the price increases exponentially for an increase in any dimension.

Categorical Features and Price

In [18]:
#Plot of Clarity and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['clarity'],diamond_data_sample['price']);
plt.title('Figure 10: Relationship between Clarity and Price',fontsize = 16);
plt.show()

Figure 10 shows the relationship between the diamond's clarity and its price. From these boxplots, it can be clearly observed that the clarity of a diamond has a relatively significant affect on the price. The clarities VVS1 and IF appear to have the lowest price range amongst the diamonds for this data set, while VS1, VS2 and SI2 seem to be more commonly higher than the rest.

In [19]:
#Plot of Cut and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['cut'],diamond_data_sample['price']);
plt.title('Figure 11: Relationship between Cut and Price',fontsize = 16);
plt.show();

Figure 11 shows the relationship between the cut of a diamond and its price. It can be seen that from our data, the cut of the diamond does have some affect on the price, with there being some variation in the prices for different cuts. Although there is not too much difference between each boxplot, it can be seen that overall the premium cut has the highest prices.

In [20]:
#Plot of Color and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['color'],diamond_data_sample['price']);
plt.title('Figure 12: Relationship between Color and Price',fontsize = 16);
plt.show()

Figure 12 shows the relationship between the diamond's color and price. From the boxplots, there is a notable variation in the price depending on its color. The color doesn't appear to have to much affect on the lower price range, but does have a significant affect on its higher prices. J and I appear to have the highest prices, with I coming out on top.

In [21]:
#Plot of Cut, Color and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['cut'],diamond_data_sample['price'],
            hue = diamond_data_sample['color'], width = 0.8);
plt.title('Figure 13: Relationship between Cut, Color and Price',fontsize= 16);
plt.show();

Combining Figures 11 and 12, Figure 13 shows the relationship between the cut and color of a diamond and its price. This gives us a more detailed view of the relationship between a diamond's cut, color and price.

It can be seen that colors of diamonds can have different prices at differing cuts, with premium cut having the highest overall prices, and fair cut having the smallest range of prices.

In [22]:
# Plot of Clarity, Carat and Price
plt.figure(figsize=(18,6))
sns.scatterplot(diamond_data_sample['carat'], diamond_data_sample['price'], hue = diamond_data_sample['clarity'])
plt.title('Figure 14: Relationship between Clarity, Carat and Price', fontsize = 16);
plt.legend(loc = 'upper right')
plt.show();

Figure 14 takes the relationship between clarity and price from Figure 10 and adds carat to the relationship, giving a more detailed look at the affect clarity and carat have on price.

The scatter plot above shows the exponential relationship between carat and price from Figure 4 and divides it into the 8 different clarities. From the plot, it appears that I1 has the smallest increase in price as the carat increases, while IF and VVS1 have the highest.


Statistical Modeling and Performance Evaluation

Full Model

We fit a full linear regression model (using all available features) to begin with. As noted previously depth and table are derivates of x, y and z, but lets leave them for now, as we can remove them in a later model.

In [23]:
diamond_data_sample.head()
Out[23]:
carat cut color clarity depth table x y z price
7453 0.90 Good H VS1 64.1 57.0 6.07 6.04 3.88 4234
9620 1.28 Premium J SI2 62.3 58.0 6.94 6.89 4.31 4636
25090 2.01 Premium J VS2 60.4 59.0 8.10 8.15 4.91 13619
22581 1.57 Ideal I SI1 62.2 56.0 7.42 7.47 4.63 10633
52943 0.70 Very Good F VS2 63.3 56.0 5.62 5.66 3.57 2593
In [24]:
# build the formula string from feature names, sans target feature
formula_string_indep_vars = ' + '.join(diamond_data_sample.drop(columns='price').columns)
formula_string = 'price ~ ' + formula_string_indep_vars

# print the formula string (sanity check)
print('formula_string: ', formula_string)

# create model from sampled dataset
full_model = sm.formula.ols(formula=formula_string, data=diamond_data_sample)

# fit the model
fitted_full_model = full_model.fit()

# print a summary of the model
print(fitted_full_model.summary(), "\n")

print(f"Regression number of terms: {len(fitted_full_model.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model.rsquared_adj:.3f}")
formula_string:  price ~ carat + cut + color + clarity + depth + table + x + y + z
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.925
Method:                 Least Squares   F-statistic:                     5362.
Date:                Sun, 27 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:29:01   Log-Likelihood:                -84192.
No. Observations:               10000   AIC:                         1.684e+05
Df Residuals:                    9976   BIC:                         1.686e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         2478.0053   1096.853      2.259      0.024     327.952    4628.059
cut[T.Good]        653.6726     76.139      8.585      0.000     504.424     802.921
cut[T.Ideal]       872.8819     75.684     11.533      0.000     724.525    1021.238
cut[T.Premium]     813.9203     72.803     11.180      0.000     671.212     956.629
cut[T.Very Good]   822.7854     72.933     11.281      0.000     679.821     965.749
color[T.E]        -214.2378     40.750     -5.257      0.000    -294.116    -134.360
color[T.F]        -292.9947     41.159     -7.119      0.000    -373.675    -212.315
color[T.G]        -474.4250     40.094    -11.833      0.000    -553.017    -395.833
color[T.H]       -1013.2402     42.934    -23.600      0.000   -1097.400    -929.081
color[T.I]       -1507.9777     47.830    -31.528      0.000   -1601.735   -1414.220
color[T.J]       -2404.3855     59.335    -40.522      0.000   -2520.694   -2288.077
clarity[T.IF]     5503.7019    114.800     47.942      0.000    5278.671    5728.733
clarity[T.SI1]    3839.8878     98.853     38.844      0.000    3646.116    4033.659
clarity[T.SI2]    2922.7658     99.126     29.485      0.000    2728.459    3117.073
clarity[T.VS1]    4792.4895    101.024     47.439      0.000    4594.462    4990.517
clarity[T.VS2]    4426.1674     99.392     44.532      0.000    4231.339    4620.996
clarity[T.VVS1]   5135.1729    107.047     47.971      0.000    4925.340    5345.006
clarity[T.VVS2]   5135.3766    103.998     49.380      0.000    4931.520    5339.233
carat             1.157e+04    109.125    106.023      0.000    1.14e+04    1.18e+04
depth              -64.1745     14.109     -4.548      0.000     -91.831     -36.518
table              -28.0726      6.682     -4.201      0.000     -41.171     -14.974
x                -1005.6104    111.337     -9.032      0.000   -1223.853    -787.368
y                   -7.4050     24.819     -0.298      0.765     -56.054      41.245
z                 -207.7815    185.183     -1.122      0.262    -570.778     155.215
==============================================================================
Omnibus:                     2823.253   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            75621.246
Skew:                           0.769   Prob(JB):                         0.00
Kurtosis:                      16.384   Cond. No.                     8.55e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.55e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 

Regression number of terms: 24
Regression F-distribution p-value: 0.0
Regression R-squared: 0.925
Regression Adjusted R-squared: 0.925

Now that we have our fitted full model, lets plot it against what the actual values are to provide visual representation of its accuracy.

In [25]:
def plot_line(axis, slope, intercept, **kargs):
    xmin, xmax = axis.get_xlim()
    plt.plot([xmin, xmax], [xmin*slope+intercept, xmax*slope+intercept], **kargs)
    
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 15: Scatterplot of price vs predicted price for Model 1', fontsize=16);
plt.show();

We found our full model's adjusted R squared to be 0.925, indicating that about 92% of the variance can be explained by the model.

We observe, from Figure 15, that the model tends to over-price diamonds from around the $10000 point onwards, with considerably more actual observations clustered below the regression line than above.

We noted earlier from Figures 2 and 3 that Price is right-skewed. From observation of the p-values, we ascertain that both the y and z features are insignificant, with y having the higher p-value.

Full Model Diagnostic Checks

In [26]:
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 16: Scatterplot of Residuals for Model 1', fontsize=16)
plt.show();

In Figure 16 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 17500-20000, x: 15000-17500).

In [27]:
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 17: Histogram of Residuals for Model 1', fontsize=16);
plt.show();

Our residuals histogram in Figure 17 looks relatively symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.

Model 2

From the full model, we now remove y as it had the highest p-value from the model summary

In [28]:
# remove the y feature as it has the highest p-value
# build the formula string from feature names, sans target feature and y
formula_string_indep_vars_v2 = ' + '.join(diamond_data_sample.drop(columns=['price','y']).columns)
formula_string_v2 = 'price ~ ' + formula_string_indep_vars_v2

# print the formula string (sanity check)
print('formula_string: ', formula_string_v2)

# create model from sampled dataset
full_model_2 = sm.formula.ols(formula=formula_string_v2, data=diamond_data_sample)

# fit the model
fitted_full_model_2 = full_model_2.fit()

# print a summary of the model
print(fitted_full_model_2.summary(), "\n")

print(f"Regression number of terms: {len(fitted_full_model_2.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_2.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_2.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_2.rsquared_adj:.3f}")
formula_string:  price ~ carat + cut + color + clarity + depth + table + x + z
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.925
Method:                 Least Squares   F-statistic:                     5606.
Date:                Sun, 27 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:29:04   Log-Likelihood:                -84192.
No. Observations:               10000   AIC:                         1.684e+05
Df Residuals:                    9977   BIC:                         1.686e+05
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         2373.0648   1038.885      2.284      0.022     336.641    4409.488
cut[T.Good]        653.2817     76.125      8.582      0.000     504.062     802.501
cut[T.Ideal]       872.6327     75.676     11.531      0.000     724.292    1020.973
cut[T.Premium]     813.4591     72.783     11.176      0.000     670.789     956.129
cut[T.Very Good]   822.4979     72.924     11.279      0.000     679.553     965.443
color[T.E]        -214.2641     40.748     -5.258      0.000    -294.138    -134.390
color[T.F]        -293.0215     41.157     -7.120      0.000    -373.698    -212.345
color[T.G]        -474.4867     40.092    -11.835      0.000    -553.074    -395.899
color[T.H]       -1013.4384     42.927    -23.608      0.000   -1097.584    -929.293
color[T.I]       -1508.0771     47.827    -31.532      0.000   -1601.828   -1414.326
color[T.J]       -2404.2759     59.331    -40.523      0.000   -2520.577   -2287.975
clarity[T.IF]     5504.2735    114.779     47.956      0.000    5279.284    5729.263
clarity[T.SI1]    3840.3801     98.835     38.857      0.000    3646.644    4034.116
clarity[T.SI2]    2923.1834     99.112     29.494      0.000    2728.904    3117.462
clarity[T.VS1]    4793.0270    101.004     47.454      0.000    4595.040    4991.014
clarity[T.VS2]    4426.6995     99.372     44.547      0.000    4231.911    4621.488
clarity[T.VVS1]   5135.7066    107.027     47.985      0.000    4925.912    5345.501
clarity[T.VVS2]   5135.9077    103.978     49.394      0.000    4932.091    5339.725
carat             1.157e+04    109.119    106.028      0.000    1.14e+04    1.18e+04
depth              -62.5002     12.944     -4.828      0.000     -87.874     -37.126
table              -28.0457      6.681     -4.198      0.000     -41.142     -14.949
x                 -996.0374    106.609     -9.343      0.000   -1205.012    -787.063
z                 -235.2920    160.589     -1.465      0.143    -550.079      79.495
==============================================================================
Omnibus:                     2823.721   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            75533.483
Skew:                           0.770   Prob(JB):                         0.00
Kurtosis:                      16.376   Cond. No.                     8.06e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 

Regression number of terms: 23
Regression F-distribution p-value: 0.0
Regression R-squared: 0.925
Regression Adjusted R-squared: 0.925
In [29]:
# Creating scatter plot
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_2.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 18: Scatterplot of price against predicted price for Model 2', fontsize=16);
plt.show(); 

This model returns an Adjusted R-Squared 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but with 1 less variable. Looking at the p-values we can see that they are all signinficant at the 5% level except for z has a p-value of 0.143 and is therefore insignificant at the 5% level.

Although both y and z had p-values greater than 0.05, the y feature was the only feature removed from this model because of its p-value being the highest.

Model 2 Diagnostic Checks

In [30]:
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_2.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 19: Scatterplot of Residuals for Model 2', fontsize=16)
plt.show();

In Figure 19 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).

In [31]:
residuals2 = diamond_data_sample['price'] - fitted_full_model_2.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals2, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 20: Histogram of residuals for Model 2', fontsize = 16)
plt.show();

Our residuals histogram in Figure 20 again look relatively symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.

Model 3

Continuing on from the previous model, we then removed z as it was the only other variable with a p-value greater than 0.05.

In [32]:
# remove the z feature, from the previous model, as it has the next highest p-value
# build the formula string from feature names, sans target feature, y and z
formula_string_indep_vars_v3 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z']).columns)
formula_string_v3 = 'price ~ ' + formula_string_indep_vars_v3

# print the formula string (sanity check)
print('formula_string: ', formula_string_v3)

# create model from sampled dataset
full_model_3 = sm.formula.ols(formula=formula_string_v3, data=diamond_data_sample)

# fit the model
fitted_full_model_3 = full_model_3.fit()

# print a summary of the model
print(fitted_full_model_3.summary(), "\n")

print(f"Regression number of terms: {len(fitted_full_model_3.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_3.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_3.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_3.rsquared_adj:.3f}")
formula_string:  price ~ carat + cut + color + clarity + depth + table + x
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.925
Method:                 Least Squares   F-statistic:                     5872.
Date:                Sun, 27 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:29:08   Log-Likelihood:                -84193.
No. Observations:               10000   AIC:                         1.684e+05
Df Residuals:                    9978   BIC:                         1.686e+05
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         3156.0847    890.937      3.542      0.000    1409.668    4902.502
cut[T.Good]        653.1373     76.129      8.579      0.000     503.909     802.365
cut[T.Ideal]       873.5628     75.678     11.543      0.000     725.219    1021.907
cut[T.Premium]     816.6711     72.754     11.225      0.000     674.058     959.284
cut[T.Very Good]   821.4838     72.925     11.265      0.000     678.537     964.431
color[T.E]        -214.1870     40.750     -5.256      0.000    -294.065    -134.309
color[T.F]        -292.9960     41.159     -7.119      0.000    -373.677    -212.315
color[T.G]        -473.8633     40.092    -11.820      0.000    -552.451    -395.276
color[T.H]       -1013.4325     42.930    -23.607      0.000   -1097.583    -929.282
color[T.I]       -1506.6215     47.820    -31.506      0.000   -1600.358   -1412.885
color[T.J]       -2404.7763     59.334    -40.530      0.000   -2521.082   -2288.470
clarity[T.IF]     5496.6524    114.667     47.936      0.000    5271.881    5721.423
clarity[T.SI1]    3833.9347     98.742     38.828      0.000    3640.380    4027.490
clarity[T.SI2]    2916.3903     99.009     29.456      0.000    2722.313    3110.468
clarity[T.VS1]    4786.1057    100.899     47.435      0.000    4588.324    4983.888
clarity[T.VS2]    4419.9892     99.272     44.524      0.000    4225.397    4614.582
clarity[T.VVS1]   5128.7909    106.929     47.965      0.000    4919.189    5338.393
clarity[T.VVS2]   5128.8821    103.873     49.376      0.000    4925.270    5332.494
carat             1.156e+04    108.926    106.127      0.000    1.13e+04    1.18e+04
depth              -75.6902      9.302     -8.137      0.000     -93.924     -57.456
table              -27.7001      6.678     -4.148      0.000     -40.789     -14.611
x                -1137.0088     45.918    -24.762      0.000   -1227.017   -1047.001
==============================================================================
Omnibus:                     2823.301   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            75767.792
Skew:                           0.769   Prob(JB):                         0.00
Kurtosis:                      16.397   Cond. No.                     6.88e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.88e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 

Regression number of terms: 22
Regression F-distribution p-value: 0.0
Regression R-squared: 0.925
Regression Adjusted R-squared: 0.925
In [33]:
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_3.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 21: Scatterplot of price vs predicted price for Model 3', fontsize=16);
plt.show();

Like with the previous model, after removing z on top of y from the model, the Adjusted R-Squared value remains unchanged at 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but with 2 less variables.

With y having already been removed, z was the only other variable with a p-value greater than 0.05 and was therefore remove for this model.

Model 3 Diagnostic Checks

In [34]:
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_3.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 22: Scatterplot of Residuals for Model 3', fontsize=16)
plt.show();

In Figure 22 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 5000 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).

In [35]:
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_3.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 23: Histogram of Residuals for Model 3', fontsize=16);
plt.show();

Our residuals histogram in Figure 23 yet again look symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.

Model 4

Next we decided to try and remove both table and depth from the previous model as they are both derivative of x, y and z. However, we decided to only do one at a time, starting with table.

In [36]:
# remove the Table feature as per comments throughout (because it is derivative of X, Y, Z)
# build the formula string from feature names, sans target feature, y, z and table
formula_string_indep_vars_v4 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z','table']).columns)
formula_string_v4 = 'price ~ ' + formula_string_indep_vars_v4

# print the formula string (sanity check)
print('formula_string: ', formula_string_v4)

# create model from sampled dataset
full_model_4 = sm.formula.ols(formula=formula_string_v4, data=diamond_data_sample)

# fit the model
fitted_full_model_4 = full_model_4.fit()

# print a summary of the model
print(fitted_full_model_4.summary(), "\n")

print(f"Regression number of terms: {len(fitted_full_model_4.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_4.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_4.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_4.rsquared_adj:.3f}")
formula_string:  price ~ carat + cut + color + clarity + depth + x
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.925
Method:                 Least Squares   F-statistic:                     6155.
Date:                Sun, 27 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:29:11   Log-Likelihood:                -84202.
No. Observations:               10000   AIC:                         1.684e+05
Df Residuals:                    9979   BIC:                         1.686e+05
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          513.6189    623.394      0.824      0.410    -708.360    1735.598
cut[T.Good]        684.9274     75.804      9.036      0.000     536.337     833.518
cut[T.Ideal]       991.1004     70.230     14.112      0.000     853.434    1128.766
cut[T.Premium]     867.0291     71.793     12.077      0.000     726.301    1007.757
cut[T.Very Good]   884.5315     71.381     12.392      0.000     744.610    1024.453
color[T.E]        -216.4926     40.779     -5.309      0.000    -296.428    -136.557
color[T.F]        -295.2270     41.189     -7.168      0.000    -375.966    -214.488
color[T.G]        -473.5800     40.124    -11.803      0.000    -552.231    -394.929
color[T.H]       -1015.3462     42.962    -23.634      0.000   -1099.560    -931.132
color[T.I]       -1508.4855     47.856    -31.521      0.000   -1602.294   -1414.678
color[T.J]       -2404.0618     59.382    -40.485      0.000   -2520.462   -2287.662
clarity[T.IF]     5507.0471    114.733     47.999      0.000    5282.147    5731.947
clarity[T.SI1]    3831.2753     98.820     38.770      0.000    3637.567    4024.983
clarity[T.SI2]    2917.7842     99.089     29.446      0.000    2723.550    3112.018
clarity[T.VS1]    4787.8290    100.980     47.414      0.000    4589.888    4985.770
clarity[T.VS2]    4420.6294     99.352     44.495      0.000    4225.879    4615.380
clarity[T.VVS1]   5135.2849    107.004     47.991      0.000    4925.535    5345.035
clarity[T.VVS2]   5133.3979    103.952     49.383      0.000    4929.632    5337.164
carat             1.154e+04    108.926    105.960      0.000    1.13e+04    1.18e+04
depth              -59.9775      8.503     -7.054      0.000     -76.645     -43.310
x                -1133.7924     45.948    -24.675      0.000   -1223.861   -1043.724
==============================================================================
Omnibus:                     2834.160   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            74134.485
Skew:                           0.782   Prob(JB):                         0.00
Kurtosis:                      16.247   Cond. No.                     3.55e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 

Regression number of terms: 21
Regression F-distribution p-value: 0.0
Regression R-squared: 0.925
Regression Adjusted R-squared: 0.925
In [37]:
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_4.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 24: Scatterplot of price vs predicted price for Model 4', fontsize=16);
plt.show();

After removing table from the model, it can be seen that the Adjusted R-Squared value still remains unchanged at 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but now with 3 less variables.

This also shows that, like we thought, table and most likely depth do not affect the model.

Model 4 Diagnostic Checks

In [38]:
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_4.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 25: Scatterplot of Residuals for Model 4', fontsize=16)
plt.show();

In Figure 25 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).

In [39]:
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_4.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 26: Histogram of Residuals for Model 4', fontsize=16);
plt.show();

After removing table, our residuals histogram in Figure 26 still looks symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.

Model 5

Finally, as mentioned previously, we removed depth from the model as we were quite confident that it would have no affect on it.

In [40]:
# remove the Depth feature as per comments throughout (because like table, it is derivative of X, Y, Z)
# build the formula string from feature names, sans target feature, y, z, table and depth
formula_string_indep_vars_v5 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z','table','depth']).columns)
formula_string_v5 = 'price ~ ' + formula_string_indep_vars_v5

# print the formula string (sanity check)
print('formula_string: ', formula_string_v5)

# create model from sampled dataset
full_model_5 = sm.formula.ols(formula=formula_string_v5, data=diamond_data_sample)

# fit the model
fitted_full_model_5 = full_model_5.fit()

# print a summary of the model
print(fitted_full_model_5.summary(), "\n")

print(f"Regression number of terms: {len(fitted_full_model_5.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_5.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_5.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_5.rsquared_adj:.3f}")
formula_string:  price ~ carat + cut + color + clarity + x
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     6445.
Date:                Sun, 27 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:29:15   Log-Likelihood:                -84227.
No. Observations:               10000   AIC:                         1.685e+05
Df Residuals:                    9980   BIC:                         1.686e+05
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept        -3642.3945    204.155    -17.841      0.000   -4042.579   -3242.210
cut[T.Good]        758.5765     75.264     10.079      0.000     611.043     906.110
cut[T.Ideal]      1102.6197     68.595     16.074      0.000     968.160    1237.079
cut[T.Premium]    1006.9670     69.165     14.559      0.000     871.389    1142.545
cut[T.Very Good]   991.9589     69.908     14.190      0.000     854.926    1128.992
color[T.E]        -218.3064     40.878     -5.340      0.000    -298.436    -138.177
color[T.F]        -301.5355     41.280     -7.305      0.000    -382.453    -220.618
color[T.G]        -481.1003     40.208    -11.965      0.000    -559.916    -402.285
color[T.H]       -1027.1736     43.034    -23.869      0.000   -1111.529    -942.818
color[T.I]       -1518.1499     47.953    -31.659      0.000   -1612.148   -1424.152
color[T.J]       -2409.2112     59.522    -40.476      0.000   -2525.886   -2292.536
clarity[T.IF]     5560.9256    114.758     48.458      0.000    5335.977    5785.874
clarity[T.SI1]    3859.1446     98.982     38.988      0.000    3665.119    4053.170
clarity[T.SI2]    2951.6025     99.214     29.750      0.000    2757.123    3146.082
clarity[T.VS1]    4826.9113    101.074     47.756      0.000    4628.786    5025.036
clarity[T.VS2]    4455.4242     99.472     44.791      0.000    4260.439    4650.409
clarity[T.VVS1]   5181.7155    107.062     48.399      0.000    4971.852    5391.579
clarity[T.VVS2]   5178.9636    104.004     49.796      0.000    4975.095    5382.832
carat             1.136e+04    106.148    107.036      0.000    1.12e+04    1.16e+04
x                -1054.1322     44.648    -23.610      0.000   -1141.651    -966.614
==============================================================================
Omnibus:                     2838.321   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            68532.171
Skew:                           0.808   Prob(JB):                         0.00
Kurtosis:                      15.723   Cond. No.                         156.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 

Regression number of terms: 20
Regression F-distribution p-value: 0.0
Regression R-squared: 0.925
Regression Adjusted R-squared: 0.924
In [41]:
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_5.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 27: Scatterplot of price vs predicted price for Model 5', fontsize=16);
plt.show();

After removing depth from the model, a change in the Adjusted R-Squared value occured, bringing down from 0.925 to 0.924. The change however was so insignificant that 92% of the variance can still be explained by the reduced model, now only having 5 of its original 9 explanatory variables left.

Like with table, we were fairly confident that removing depth would leave the model unaffected. The above proved us correct.

Model 5 Diagnostic Checks

In [42]:
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_5.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 28: Scatterplot of Residuals for Model 5', fontsize=16)
plt.show();

In Figure 28 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).

In [43]:
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_5.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 29: Histogram of Residuals for Model 5', fontsize=16);
plt.show();

Although there is a slight change now, our residuals histogram in Figure 29 still looks symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.


Summary and Conclusions

Of the five linear regression models tested, the best (most relevant features and most accurate predictive power) is Model 5, which utilises features cut, carat, color, clarity and x only. These are the only features that demonstrated statistical signficance at any conventional level, observed by the unchanging R-squared value for each subsequent model as we removed all the other features not present in Model 5, from Model 2, 3 and 4 (model 2, 3 and 4 had the exact same R-squared, 0.925).

To reinforce this further, the only time we could introduce a drastic change in any of the models was by removal of any one of the final features present in Model 5. We did not consider the tiny difference in R-squared between the fifth and all other models (difference of 0.001) to be significant, and as such disregarded it.

Our best regression model, (Model 5), reasonably accurately predicts price within ±5000 USD, up until the $15000 point onwards, from which it tends to slightly over-price, showing more actual observations clustered below the regression line than above from that point on.

From observing the x, ,y, and z vs price plots (Figures 7, 8, 9) we note that the x feature of Model 5 could likely be replaced with either the y or z feature, as the plots for these three features vs price are trivially dissimilar, and represent three physical properties of each diamond that are very similar (dimension by axis).

Thus, we have found that there is a significant relationship between the price of a diamond and the features in Model 5, and that we can predict the price of a given diamond best utilising Model 5's feature set.


References

Fisher, F. (2013). Diamond Depth Percentages In Round Brilliant Cut Diamonds [online]. Available at http://findmyrock.com/diamond-basics/diamond-cut/understanding-diamond-depth-percentages/ [Accessed 2019-11-09]

Aksakali, V. (2019). datasets/diamonds.csv [online]. Available at https://github.com/vaksakalli/datasets/blob/master/diamonds.csv [Accessed 2019-11-09]

Wickham, H, Chang, W, Henry, L, Pederson, TL, Takahashi, K, Wilke, C, Woo, K, Yutani, H. (2015) Prices of 50,000 round cut diamonds [online]. Available at https://ggplot2.tidyverse.org/reference/diamonds.html [Accessed 2019-11-09]

Wickham, H, Chang, W, Henry, L, Pederson, TL, Takahashi, K, Wilke, C, Woo, K, Yutani, H. (2015) diamonds.csv [online]. Available at https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv [Accessed 2019-11-09]